Retrieval effectiveness study with Farsi language

نویسندگان

  • Mitra Akasereh
  • Jacques Savoy
چکیده

Having Farsi as the underlying language and using a test collection of 166,774 documents and 100 topics, this experiment evaluates the retrieval effectiveness of different IR models while using a light and a plural stemmer as well as n-grams and trunc-n indexing strategies. Moreover the impact of stoplist removal is evaluated. According to the obtained results the DFR-I(ne)C2 model is the best performing one. The proposed light and plural stemmer improve the retrieval performance compare to non-stemming approach. Indexing strategies trunc-4 and trunc-5 have also a positive impact on the performance while 3-grams and trunc-3 have the most negative impact on the results. The results reveal that for Farsi stoplist removal plays an important role in improving the retrieval performance. A query-byquery analysis on the results shows that avoiding extreme results would be possible by adding extra controls and rules, according to Farsi morphology, to the stemming algorithms. RƒSUMƒ. Dans le but dÕutiliser le persan comme langue de rŽfŽrence, et en utilisant une collection test de 166 774 documents et de 100 requtes, cette Žtude Žvalue la performance des diffŽrents modles de RI sur lesquels sont appliquŽes diverses stratŽgies dÕindexation et de recherche. De plus, cette Žtude Žvalue lÕimpact de lՎlimination de la liste des mots-outils lors de lÕindexation. Selon les rŽsultats obtenus, le modle DFR-I(ne)C2 est le plus performant. LÕenracineur lŽger et lÕenracineur pluriel amŽliorent la performance en comparaison ˆ lÕapproche sans enracineur. Les stratŽgies dÕindexation, comme tronc-4 et tronc-5 amŽliorent la performance, alors que les approches comme 3-grams et tronc-3 ont lÕimpact le plus nŽgatif sur les rŽsultats. Les rŽsultats rŽvlent que lՎlimination de la liste des mots-outils joue un r™le important dans l'amŽlioration de la performance. L'analyse requtes par requtes montre quÕil serait possible dÕajouter des rgles supplŽmentaires aux enracineurs, pour Žviter des rŽsultats erronŽs.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Text Surrounding Method to Enhance Retrieval of Online Images by Google Search Engine

Purpose: the current research aimed to compare the effectiveness of various tags and codes for retrieving images from the Google. Design/methodology: selected images with different characteristics in a registered domain were carefully studied. The exception was that special conceptual features have been apportioned for each group of images separately. In this regard, each group image surr...

متن کامل

Assessment of a Modern Farsi Corpus

The development of Language Engineering (LE) and Information Retrieval (IR) applications requires availability of sizeable, reliable and representative corpora. This paper describes how we have constructed a well-structured 345 MB tagged corpus of news, and presents some beneficial statistics of this corpus based upon the characteristics of Farsi language. It also goes into particular detail on...

متن کامل

Designing a Distributed search engine for Farsi/English web pages

In this paper we have tried to model, design and test a prototype of Farsi/English search engine. The engine has the duty of covering the web media features such as heterogeneity, volatility and huge amount of unstructured worldwide information. These features as well as the rapid advance in technology, challenge the effectiveness of classical Information Retrieval (IR) techniques. Although a g...

متن کامل

Ad Hoc Retrieval with the Persian Language

This paper describes our participation to the Persian ad hoc search during the CLEF 2009 evaluation campaign. In this task, we suggest using a light suffix-stripping algorithm for the Farsi (or Persian) language. The evaluations based on different probabilistic models demonstrated that our stemming approach performs better than a stemmer removing only the plural suffixes, or statistically bette...

متن کامل

UniNE at CLEF 2008: TEL, and Persian IR

In our participation in this evaluation campaign, our first objective was to analyze retrieval effectiveness when using The European Library (TEL) corpora composed of very short descriptions (library catalog records) and also to evaluate the retrieval effectiveness of several IR models. As a second objective we wanted to design and evaluate a stopword list and a light stemming strategy for the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012